bandit framework
A Bandit Framework for Strategic Regression
We consider a learner's problem of acquiring data dynamically for training a regression model, where the training data are collected from strategic data sources. A fundamental challenge is to incentivize data holders to exert effort to improve the quality of their reported data, despite that the quality is not directly verifiable by the learner. In this work, we study a dynamic data acquisition process where data holders can contribute multiple times.
A Bandit Framework for Strategic Regression
We consider a learner's problem of acquiring data dynamically for training a regression model, where the training data are collected from strategic data sources. A fundamental challenge is to incentivize data holders to exert effort to improve the quality of their reported data, despite that the quality is not directly verifiable by the learner. In this work, we study a dynamic data acquisition process where data holders can contribute multiple times.
Reviews: A Bandit Framework for Strategic Regression
The question studied in the paper is interesting, and borrowing the idea from peer prediction to use the other arms' predictions as an unbiased estimator of the quality of one arm's prediction is a nice idea (in particular, because those arms need to be incentivized enough to make reasonably accurate predictions). However, the paper focuses too much on presenting "bells and whistles" rather than giving a deeper understanding of the basic (and main) results. Perhaps reorganizing the paper to only briefly mention the computational/privacy aware variants and giving both more intuition and technical content describing the main result (namely, that there exist \alpha-BNE with small regret) would focus the paper and give the reader a cleaner message of what the paper is doing. This tact would have the added benefit that the reader might be able to better assess the "quantitative" consequences of this work, in that it would leave more room for the authors to ruminate on how much better or worse these bounds are than what one could get in the non-strategic setting, or in various trivial simplifications/special cases of this model. As the paper stands, this reviewer finds it difficult to assess from the main body of the paper alone the technical contribution of the paper (and whether the results follow from a mild reworking of standard proofs or need substantial, new ideas). It is also difficult to assess a theory paper which gives not even a sketch or an outline of a proof in the main body of the paper.
Dynamic Pricing in Securities Lending Market: Application in Revenue Optimization for an Agent Lender Portfolio
Xu, Jing, Hsu, Yung Cheng, Biscarri, William
Securities lending is an important part of the financial market structure, where agent lenders help long term institutional investors to lend out their securities to short sellers in exchange for a lending fee. Agent lenders within the market seek to optimize revenue by lending out securities at the highest rate possible. Typically, this rate is set by hard-coded business rules or standard supervised machine learning models. These approaches are often difficult to scale and are not adaptive to changing market conditions. Unlike a traditional stock exchange with a centralized limit order book, the securities lending market is organized similarly to an e-commerce marketplace, where agent lenders and borrowers can transact at any agreed price in a bilateral fashion. This similarity suggests that the use of typical methods for addressing dynamic pricing problems in e-commerce could be effective in the securities lending market. We show that existing contextual bandit frameworks can be successfully utilized in the securities lending market. Using offline evaluation on real historical data, we show that the contextual bandit approach can consistently outperform typical approaches by at least 15% in terms of total revenue generated.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
DISCO: An End-to-End Bandit Framework for Personalised Discount Allocation
Zhang, Jason Shuo, Howson, Benjamin, Savva, Panayiota, Loh, Eleanor
Personalised discount codes provide a powerful mechanism for managing customer relationships and operational spend in e-commerce. Bandits are well suited for this product area, given the partial information nature of the problem, as well as the need for adaptation to the changing business environment. Here, we introduce DISCO, an end-to-end contextual bandit framework for personalised discount code allocation at ASOS. DISCO adapts the traditional Thompson Sampling algorithm by integrating it within an integer program, thereby allowing for operational cost control. Because bandit learning is often worse with high dimensional actions, we focused on building low dimensional action and context representations that were nonetheless capable of good accuracy. Additionally, we sought to build a model that preserved the relationship between price and sales, in which customers increasing their purchasing in response to lower prices ("negative price elasticity"). These aims were achieved by using radial basis functions to represent the continuous (i.e. infinite armed) action space, in combination with context embeddings extracted from a neural network. These feature representations were used within a Thompson Sampling framework to facilitate exploration, and further integrated with an integer program to allocate discount codes across ASOS's customer base. These modelling decisions result in a reward model that (a) enables pooled learning across similar actions, (b) is highly accurate, including in extrapolation, and (c) preserves the expected negative price elasticity. Through offline analysis, we show that DISCO is able to effectively enact exploration and improves its performance over time, despite the global constraint. Finally, we subjected DISCO to a rigorous online A/B test, and find that it achieves a significant improvement of >1% in average basket value, relative to the legacy systems.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.47)
A Bandit Framework for Strategic Regression
We consider a learner's problem of acquiring data dynamically for training a regression model, where the training data are collected from strategic data sources. A fundamental challenge is to incentivize data holders to exert effort to improve the quality of their reported data, despite that the quality is not directly verifiable by the learner. In this work, we study a dynamic data acquisition process where data holders can contribute multiple times. We propose a Strategic Regression-Upper Confidence Bound (SR-UCB) framework, an UCB-style index combined with a simple payment rule, where the index of a worker approximates the quality of his past contributions and is used by the learner to determine whether the worker receives future work. For linear regression and certain family of non-linear regression problems, we show that SR-UCB enables a $O(\sqrt{\log T/T})$-Bayesian Nash Equilibrium (BNE) where each worker exerting a target effort level that the learner has chosen, with $T$ being the number of data acquisition stages.
A Bandit Framework for Optimal Selection of Reinforcement Learning Agents
Merentitis, Andreas, Rasul, Kashif, Vollgraf, Roland, Sheikh, Abdul-Saboor, Bergmann, Urs
Deep Reinforcement Learning has been shown to be very successful in complex games, e.g. Atari or Go. These games have clearly defined rules, and hence allow simulation. In many practical applications, however, interactions with the environment are costly and a good simulator of the environment is not available. Further, as environments differ by application, the optimal inductive bias (architecture, hyperparameters, etc.) of a reinforcement agent depends on the application. In this work, we propose a multi-arm bandit framework that selects from a set of different reinforcement learning agents to choose the one with the best inductive bias. To alleviate the problem of sparse rewards, the reinforcement learning agents are augmented with surrogate rewards. This helps the bandit framework to select the best agents early, since these rewards are smoother and less sparse than the environment reward. The bandit has the double objective of maximizing the reward while the agents are learning and selecting the best agent after a finite number of learning steps. Our experimental results on standard environments show that the proposed framework is able to consistently select the optimal agent after a finite number of steps, while collecting more cumulative reward compared to selecting a sub-optimal architecture or uniformly alternating between different agents.
- North America > United States (0.14)
- Europe > Germany > Berlin (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)